406 research outputs found

    An XML-based Tool for Tracking English Inclusions in German Text

    Get PDF
    The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions in German newspaper articles. The output of the tool can assist lexical resource developers in monitoring changing patterns of English inclusion usage. The corpus used for the classification covers three different domains. We report the classification results and illustrate their value to linguistic and NLP research

    The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text

    Get PDF
    In this paper we discuss five different corpora annotated for protein names. We present several within- and cross-dataset protein tagging experiments showing that different annotation schemes severely affect the portability of statistical protein taggers. By means of a detailed error analysis we identify crucial annotation issues that future annotation projects should take into careful consideration

    Using foreign inclusion detection to improve parsing performance

    Get PDF
    Inclusions from other languages can be a significant source of errors for monolin-gual parsers. We show this for English in-clusions, which are sufficiently frequent to present a problem when parsing German. We describe an annotation-free approach for accurately detecting such inclusions, and de-velop two methods for interfacing this ap-proach with a state-of-the-art parser for Ger-man. An evaluation on the TIGER cor-pus shows that our inclusion entity model achieves a performance gain of 4.3 points in F-score over a baseline of no inclusion de-tection, and even outperforms a parser with access to gold standard part-of-speech tags.

    Optimising Selective Sampling for Bootstrapping Named Entity Recognition

    Get PDF
    Training a statistical named entity recognition system in a new domain requires costly manual annotation of large quantities of in-domain data. Active learning promises to reduce the annotation cost by selecting only highly informative data points. This paper is concerned with a real active learning experiment to bootstrap a named entity recognition system for a new domain of radio astronomical abstracts. We evaluate several committee-based metrics for quantifying the disagreement between classifiers built using multiple views, and demonstrate that the choice of metric can be optimised in simulation experiments with existing annotated data from different domains. A final evaluation shows that we gained substantial savings compared to a randomly sampled baseline. 1

    Grounding Gene Mentions with Respect to Gene Database Identifiers

    Get PDF
    We describe our submission for task 1B of the BioCreAtIvE competition which is concerned with grounding gene mentions with respect to databases of organism gene identifiers. Several approaches to gene identification, lookup, and disambiguation are presented. Results are presented with two possible baseline systems and a discussion of the source of precision and recall errors as well as an estimate of precision and recall for an organism-specific tagger bootstrapped from gene synonym lists and the task 1B training data. 1

    The Effect of Predator Culling on Livestock Losses: Caracal Control in Cooper Hunting Club, 1976 - 1981

    Get PDF
    This paper investigates the effectiveness of predator culling as a means of reducing livestock losses using hunting club data for Cooper (outside Mossel Bay) for the period 1976 to 1981. Results showed that caracal (Caracal caracal) culling increased subsequent livestock losses when compared to farms where fewer caracals were culled. When controlling for lagged rainfall, remoteness and a proxy for other unobserved farm characteristics, a logit model indicated the marginal effect of culling to be a 17.5% increase in the likelihood of livestock losses during the next year. The corresponding negative binomial model estimated the effect of an additional caracal culled to be a 0.373 unit increase in the number of sheep lost. A lagged rainfall variable was negative and significant in both models. According to the logit results, the marginal millimetre of rain reduced subsequent losses by 1.1%. For the negative binomial model, the marginal effect of rainfall was reduced losses by 0.047 of a sheep, which was about a 5% increase in losses. The average number of livestock lost was 0.94 sheep per farm per year. Distance travelled, used to proxy remoteness, was positive in the negative binomial model and non-significant in the logit model. Lagged livestock losses were not significant in either model. This result is important because it provides support for stricter predator control regulations by showing that livestock farmers are inadvertently harming their own interests through inappropriate culling, a practice which continues to this day
    corecore